-
Notifications
You must be signed in to change notification settings - Fork 725
make event dispatching non-blocking #6762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
make event dispatching non-blocking #6762
Conversation
| @@ -0,0 +1,322 @@ | |||
| use std::path::PathBuf; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Please add the copyright header to each source file. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good call, will do. Is "Stacks Open Internet Foundation" still the correct copyright holder?
Also, there's a whole bunch of files that lack that header, I'm wondering if there's a good way to automate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report❌ Patch coverage is ❌ Your project check has failed because the head coverage (74.08%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #6762 +/- ##
===========================================
+ Coverage 67.63% 74.08% +6.45%
===========================================
Files 586 587 +1
Lines 362403 362732 +329
===========================================
+ Hits 245099 268747 +23648
+ Misses 117304 93985 -23319
... and 361 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
... instead of using unnamed tuples and long parameter lists.
This will allow us to output warnings if the (non-blocking) delivery gets too far behind, because we can tell how long it took between enqueuing the event and actually sending it. This commit adds another migration to said database, so I slightly refactored the migration code.
98b3395 to
0d62bd4
Compare
This commit is the main implementation work for stacks-network#6543. It moves event dispatcher HTTP requests to a separate thread. That way, a slow event observer doesn't block the node from continuing its work. Only if your event observers are so slow that the node is continuously producing events faster than they can be delivered, will it eventually start blocking again, because the queue size for pending requests is bounded (at 1,000 right now, but I picked that number out of a hat, happy to change it if anyone has thoughts). Each new event payload is stored in the event observer DB, and its ID is then sent to the subthread, which will make the request and then delete the DB entry. That way, if a node is shut down while there are pending requests, they're in the DB ready to be retried after restart via `process_pending_payloads()` (which blocks until completion). So that's exactly as before (except that previously there couldn't have been more than one or two pending payloads).
This fixes [this integration test failure](https://github.com/stacks-network/stacks-core/actions/runs/20749024845/job/59577684952?pr=6762), caused by the fact that event delivery wasn't complete by the time the assertions were made.
Doing this work in the RunLoop implementations' startup code is *almost* the same thing, but not quite, since the nakamoto run loop might be started later (after an epoch 3 transition), at which point the event DB may already have new items from the current run of the application, which should *not* be touched by `process_pending_payloads`. This used to not be a problem, but now that that DB is used for the actual queue of the (concurrently running) EventDispatcherWorker, it has become one.
This is like 72437b2, but it works for all the tests instead of only the one. While only that one test very obviously failed, the issue exists for pretty much all of the integration tests, because they rely on the test_observer to capture all relevant data. Things are fast enough, and therefore we've only seen one blatant failure, but 1) it's going to be flaky (I can create a whole lot of test failures by adding a small artificial delay to event delivery), and 2) it might actually be *hiding* test failures (in some cases, like e.g. neon_integrations::deep_contract, we're asserting that certain things are *not* in the data, and if the data is incomplete to begin with, those assertions are moot).
478efa3 to
d5fa2fc
Compare
When switching runloops at the epoch 2/3 transition, this ensures that the same event dispatcher worker thread is handling delivery, which in turn ensures that all payloads are delivered in order
As per this thread: stacks-network#6795 (review) I used the same table/column name and semantics that we use elsewhere for the same purposes. Also fixed a comment typo.
…event-dispatcher-tweaks
Thanks Hank for the tip!
Not sure why this wasn't caught in the pre-commit hook, I'd have assumed the checks are the same.
The logic is slightly tricky here, because the size of the queue (the max number of in-flight requests before we start blocking the thread) is implemented through the `bound` parameter of the `sync_channel`, but those two values aren't actually the same. See the comment at the top of `EventDispatcherWorker::new()` for details.
See the discussion in stacks-network#6543 for some background.
| /// to `true`, as no in-flight requests are allowed. | ||
| /// --- | ||
| /// @default: `1_000` | ||
| pub event_dispatcher_queue_size: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value ultimately ends up as the bound argument to sync_channel, which is of type usize.
I don't know if we have any concerns about having a platform-dependent number type on the configuration object -- if yes, we can also use something else here. Arguably, any reasonable values for this setting should fit into 16 bits anyway. If you need your queue to be bigger than 64k, you should reconsider you architecture.
addresses #6543
Note
This should be ready for an initial review, but note that this PR also includes the commits from #6795. Once #6795 is merged, they'll disappear from here.
This is the true diff between the two branches.
Checklist
docs/rpc/openapi.yamlandrpc-endpoints.mdfor v2 endpoints,event-dispatcher.mdfor new events)New clarity functions have corresponding PR inclarity-benchmarkingrepo